Bicriteria data compression

نویسندگان

  • Andrea Farruggia
  • Paolo Ferragina
  • Antonio Frangioni
  • Rossano Venturini
چکیده

The advent of massive datasets and the consequent design of high-performing distributed storage systems—such as BigTable by Google [7], Cassandra by Facebook [5], Hadoop by Apache—have reignited the interest of the scientific and engineering community towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed. Lempel-Ziv’s LZ77 algorithm is the de facto choice in this scenario because its decompression is significantly faster than other approaches, and its algorithmic structure is flexible enough to trade decompression speed versus compressedspace efficiency. This algorithm has been declined in many ways, the most famous ones are: the classic gzip, LZ4 and Google’s Snappy. Each of these implementations offers a trade-off between space occupancy and decompression speed, so software engineers have to content themselves by picking the one which comes closer to the requirements of the application in their hands. Starting from these premises, and for the first time in the literature, we address in this paper the problem of trading optimally, and in a principled way, the consumption of these two resources by introducing and solving what we call the Bicriteria LZ77-Parsing problem. The goal is to determine an LZ77 parsing which minimizes the space occupancy in bits of the compressed file, provided that the decompression time is bounded by a fixed amount. Symmetrically, we can exchange the role of the two resources and thus ask for minimizing the decompression time provided that the compressed space is bounded by a fixed amount. This way, the software engineer can set its space (or time) requirements and then derive the LZ77 parsing which optimizes the decompression speed (or the space occupancy, respectively), thus resulting the best possible LZ77 compression under those constraints. We solve this problem in four stages: we turn it into a sort of weight-constrained shortest path problem (WCSPP) over a weighted graph derived from the LZ77-parsing of the input file; we argue that known solutions for WSCPP are inefficient and thus unusable in practice; we prove some interesting structural properties about that graph, and then design an O(n log n)-time algorithm which computes a small additive approximation of the optimal LZ77 parsing. This additive approximation is logarithmic in the input size and thus totally negligible in practice. Finally, we sustain these arguments by performing some experiments which show that our algorithm combines the best properties of known compressors: its decompression time is close to the fastest Snappy’s and LZ4’s, and its compression ratio is close to the more succinct bzip2’s and LZMA’s. Actually, in many cases our compressor improves the best known engineered solutions mentioned above, so we can safely state that with our result software engineers have an algorithmic-knob to automatically trade in a principled way the time/space requirements of their applications. Summarizing, the three main contributions of the paper are: (i) we introduce the novel Bicriteria LZ77-Parsing problem which formalizes in a principled way what data-compressors have traditionally approached by means of heuristics; (ii) we solve this problem efficiently in O(n log n) time and optimal linear space, by proving and deploying some specific structural properties of the weighted graph derived from the possible LZ77-parsings of the input file; (iii) we execute a preliminary set of experiments which show that our novel proposal dominates all the highly engineered competitors, hence offering a win-win situation in theory&practice. ar X iv :1 30 7. 38 72 v1 [ cs .I T ] 1 5 Ju l 2 01 3

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Submodular Optimization Approach to Bicriteria Scheduling Problems with Controllable Processing Times on Parallel Machines

In this paper, we present a general methodology for designing polynomial-time algorithms for bicriteria scheduling problems on parallel machines with controllable processing times. For each considered problem, the two criteria are the makespan and the total compression cost, and the solution is delivered in the form of the break points of the efficient frontier. We reformulate the scheduling pr...

متن کامل

Bicriteria Resource Allocation Problem in Pert Networks

We develop a bicriteria model for the resource allocation problem in PERT networks, in which the total direct costs of the project as the first objective, and the mean of project completion time as the second objective are minimized. The activity durations are assumed to be independent random variables with either exponential or Erlang distributions, in which the mean of each activity duration ...

متن کامل

Single Machine Scheduling with Controllable Processing Times by submodular Optimization

In scheduling with controllable processing times the actual processing time of each job is to be chosen from the interval between the smallest (compressed or fully crashed) value and the largest (decompressed or uncrashed) value. In the problems under consideration, the jobs are processed on a single machine and the quality of a schedule is measured by two functions: the maximum cost (that depe...

متن کامل

Linear Path Skyline Computation in Bicriteria Networks

A bicriteria network is an interlinked data set where edges are labeled with two cost attributes. An example is a road network where edges represent road segments being labeled with traversal time and energy consumption. To measure the proximity of two nodes in network data, the common method is to compute a cost optimal path between the nodes. In a bicriteria network, there often is no unique ...

متن کامل

The polynomial solvability of selected bicriteria scheduling problems on parallel machines with equal length jobs and release dates

We consider bicriteria scheduling problems involving several classical well-known scheduling objectives on parallel, identical machines with job release dates. The jobs are assumed to have equal processing times. Our bicriteria treatment of these problems includes lexicographic optimization, minimization of a composite linear function, generation of schedules on the efficient frontier and the g...

متن کامل

The bicriteria minimum spanning tree problem

Let G = (V,E) be an undirected graph with a weight function and a cost function on edges. The bicriteria minimum spanning tree problem is concerned with the determination of a minimum cost spanning tree T in G subject to the constraint that the total weight in T is at most a given bound B. In this paper, we present two polynomial time approximation schemes (PTASs) for the bicriteria minimum spa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014